In this exercise, we will required the tidyverse,
gt and janitor packages.
library(tidyverse)
library(gt)
library(janitor)
The goal of this exercise is to produce some data summaries and get
some experience using pipes (%>%). The Metropolitan
Museum of Art in New York City maintains a database of more than 470,000
artworks. For the purposes of this exercise, we are going to focus on a
small number of European paintings.
The file is called
MetEuro.csvand you can read the file in using theread_csvfunction. Choosemet_euroas the name if you want to be consistent with the solutions we provide.Once you have read the file in as a data frame, you will find it in the Environment tab in RStudio and can explore it more there.
met_euro <- read_csv("MetEuro.csv")
Rows: 15 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Department, Object_Title, Artist_Name, Artist_Nationality, Medium
dbl (2): Artist_Birth_Year, Object_Age
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
There are 15 rows, and therefore artworks, in the file. Note that the age is given in years.
Use
tabylto look at this variable. Recall thatgt()produces nicely formatted tables.
met_euro %>%
tabyl(Artist_Nationality) %>%
gt()
| Artist_Nationality | n | percent |
|---|---|---|
| British | 2 | 0.13333333 |
| Dutch | 4 | 0.26666667 |
| French | 6 | 0.40000000 |
| German | 1 | 0.06666667 |
| Netherlandish | 1 | 0.06666667 |
| Swedish | 1 | 0.06666667 |
As demonstrated by the frequency tables of artist nationalities, there are 6 French artists, the most common nationality.
You can use the
adornoptions to produce actual percentages for your table.
met_euro %>%
tabyl(Medium) %>%
adorn_totals("row") %>%
adorn_pct_formatting() %>%
gt()
| Medium | n | percent |
|---|---|---|
| Ivory | 2 | 13.3% |
| Oil on canvas | 8 | 53.3% |
| Oil on wood | 1 | 6.7% |
| Pastel | 2 | 13.3% |
| Vellum | 2 | 13.3% |
| Total | 15 | 100.0% |
Oil on canvas is by far the most common medium used.
Later in this course we will learn how to sort rows in a table by, say, the frequency.
You can use the
mean,minandmaxfunctions withinsummariseto obtain these features.
met_euro %>%
summarise(Variable = "Age of Object",
Mean = mean(Object_Age),
Max = max(Object_Age),
Min = min(Object_Age)) %>%
gt()
| Variable | Mean | Max | Min |
|---|---|---|---|
| Age of Object | 218.3333 | 480 | 129 |
The average age of the paintings is 218 years old, with the newest painting 129 years olds and the oldest 480 years old.
Here we introduce
group_byto our code above to look at the descriptive statistics by medium.
met_euro %>%
group_by(Medium) %>%
summarise(Variable = "Age of Object",
Mean = mean(Object_Age),
Max = max(Object_Age),
Min = min(Object_Age)) %>%
gt()
| Medium | Variable | Mean | Max | Min |
|---|---|---|---|---|
| Ivory | Age of Object | 240.00 | 245 | 235 |
| Oil on canvas | Age of Object | 185.75 | 350 | 129 |
| Oil on wood | Age of Object | 153.00 | 153 | 153 |
| Pastel | Age of Object | 133.50 | 137 | 130 |
| Vellum | Age of Object | 444.50 | 480 | 409 |
The oldest painting was painted on Vellum.
Propose some other summaries that might be of interest and provide the code to produce them.
Download the dental decay file used in lectures and attempt to produce your own summaries.
© 2022 Statistical Consulting Centre, The University of Melbourne.